Method for multi-time scale voltage quality control based on reinforcement learning in a power distribution network

ABSTRACT

A method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network is provided, which relates to the field of power system operation and control. The method includes: constituting an optimization model for multi-time scale reactive voltage control in a power distribution network based on a reactive voltage control object of a slow discrete device and a reactive voltage control object of a fast continuous device in the power distribution network; constructing a hierarchical interaction training framework based on a two-layer Markov decision process based on the model; setting a slow agent for the slow discrete device and setting a fast agent for the fast continuous device; and deciding action values of the controlled devices by each agent based on measurement information inputted, so as to realize the multi-time scale reactive voltage control while the slow agent and the fast agent perform continuous online learning.

FIELD

The disclosure relates to the technical field of power system operation and control, and more particularly to a method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network.

BACKGROUND

In recent years, “carbon emission peak” and “carbon neutrality” have become strategic goals of our economic and social development, which put forward arduous tasks for an energy system, especially a power system. Based on rapid development of renewable energy power generation (DG) in the 21st century, a transition to clean and low-carbon energy may be further promoted and development of non-fossil energy may be accelerated, especially wind power, solar power and other new energy sources.

As the continuous increase in penetration rate of the DG in a power distribution network, the operation of the power distribution network is caused to face challenges. The refined control and optimization are thus becoming more and more important. In order to cope with a series of problems caused by the continuous increase in penetration rate of the distributed new energy, such as reverse delivery of powers, voltage violations, power quality deterioration, device disconnection, etc., a capability for controlling flexible resources is mined. A reactive voltage control system in the power distribution network has become a key component of improving a safe operation level of the power distribution network, reducing operating costs, and promoting consumption of the distributed resources. However, the current field application of reactive voltage control systems in the power distribution network often adopts a model-driven optimization paradigm, i.e., relying on accurate network models to establish optimization problems and solve control strategies. However, in engineering practice, the reliability of model parameters of the power distribution network is low, and a huge scale and a frequent change result in high maintenance costs for the model. Also, it is difficult to accurately model the influences of external networks and the device characteristics. Due to the influences of the incomplete model, conventional model-driven control system in a regional power grid faces major challenges such as inaccurate control, difficult to implement and promote. Therefore, data-driven model-free optimization methods, are important means for grid reactive voltage control, especially deep reinforcement learning methods that have developed rapidly in recent years.

However, controllable resources in the power distribution network have various types and different characteristics, especially differences in the time scale, which brings fundamental difficulties to data-driven methods and reinforcement learning methods. If a single time scale method is adopted, a waste of controllable resources will be caused and the consumption of renewable energy and may not be fully increased. For example, an installed capacity of DG as a flexible resource is often greater than its rated active power and has a fast response speed. There is a lot of adjustable space. The continuous reactive power set values may be quickly set. In contrast, controllable resources such as on-load tap changers (OLTCs) and capacitor stations have a huge impact on the power distribution network, but they can only adjust fixed gears and produce discrete control actions. At the same time, the interval between actions is long, and there are costs such as wearing. As these two types of devices have serious differences in the action nature and time scale, there is no good solution in the related art to coordinate and optimize the two types of devices under the condition of inaccurate models for the power distribution network. Generally, rough feedback methods are adopted, which is difficult to ensure the optimal operation of the power distribution network.

Therefore, it is necessary to study a method for multi-time scale reactive voltage control in a power distribution network, which can coordinate multi-time scale reactive power resources in the power distribution network for reactive voltage control, without requiring an accurate model for the power distribution network. Online learning of control process data achieves optimal reactive voltage control under the condition of incomplete models. At the same time, since the multi-time scale reactive voltage control in the power distribution network requires continuous online operation, it is necessary to ensure high safety, high efficiency, and high flexibility, so as to greatly improve the voltage quality of the power grid and reduce the network loss of the power grid.

SUMMARY

The purpose of the disclosure is to overcome the deficiencies in the related art and provide a method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network. The disclosure is specifically suitable for use of power distribution network with serious problems due to the incomplete models, which not only saves the high cost caused by repeated maintenance of accurate models, but also fully mines capabilities of controlling multi-time scale controllable resources guarantees voltage safety and economic operation of the power distribution network to the greatest extent, and be suitable for a large-scale promotion.

A method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network is provided in the disclosure. The method includes: determining a multi-time scale reactive voltage control object based on a reactive voltage control object of a slow discrete device and a reactive voltage control object of a fast continuous device in a controlled power distribution network, and establishing constraints for multi-time scale reactive voltage optimization, to constitute an optimization model for multi-time scale reactive voltage control in the power distribution network; constructing a hierarchical interaction training framework based on a two-layer Markov decision process based on the model; setting a slow agent for the slow discrete device and setting a fast agent for the fast continuous device; and performing online control with the slow agent and the fast agent, in which action values of the controlled devices are decided by each agent based on measurement information inputted, so as to realize the multi-time scale reactive voltage control while the slow agent and the fast agent perform continuous online learning and updating. The method includes the following steps.

1) determining the multi-time scale reactive voltage control object and establishing the constraints for multi-time scale reactive voltage optimization, to constitute the optimization model for multi-time scale reactive voltage control in the power distribution network comprises:

1-1) determining the multi-time scale reactive voltage control object of the controlled power distribution network:

$\begin{matrix} {O_{T} = {\min_{T_{O},T_{B}}{\sum\limits_{\overset{\sim}{t} = 0}^{\overset{\sim}{T} - 1}\left\lbrack {{C_{O}T_{O,{loss}}^{({k\overset{\sim}{t}})}} + {C_{B}T_{B,{loss}}^{({k\overset{\sim}{t}})}} + {C_{P}\min_{Q_{G},Q_{C}}{\sum\limits_{\tau = 0}^{k - 1}P_{loss}^{({{k\overset{\sim}{t}} + \tau})}}}} \right\rbrack}}} & (0.1) \end{matrix}$

where {tilde over (T)} is a number of control cycles of the slow discrete device in one day; k is an integer which represents a multiple of a number of control cycles of the fast continuous device to the number of control cycles of the slow discrete device in one day; T=k{tilde over (T)} is the number of control cycles of the fast continuous device in one day; {tilde over (t)} is a number of control cycles of the slow discrete device; T_(O) is a gear of an on-load tap changer OLTC; T_(B) is a gear of a capacitor station; Q_(G) is a reactive power output of the distributed generation DG; Q_(C) is a reactive power output of a static var compensator SVC; C_(O),C_(B),C_(P) respectively are an OLTC adjustment cost, a capacitor station adjustment cost and an active power network loss cost; P_(loss) ^((k{tilde over (t)}+τ)) is a power distribution network loss at the moment k{tilde over (t)}+τ, τ being an integer, τ=0, 1, 2, . . . , k−1; T_(O,loss) ^((k{tilde over (t)})) is a gear change adjusted by the OLTC at the moment k{tilde over (t)}, and T_(B,loss) ^((k{tilde over (t)})) a gear change adjusted by the capacitor station at the moment k{tilde over (t)}, which are respectively calculated by the following formulas:

$\begin{matrix} {{{T_{O,{loss}}^{({k\overset{\sim}{t}})} = {\sum\limits_{i = 1}^{n_{OLTC}}{❘{T_{O,i}^{({k\overset{\sim}{t}})} - T_{O,i}^{({{k\overset{\sim}{t}} - k})}}❘}}},{\overset{\sim}{t} > 0},{i \in \left\lbrack {1,n_{OLTC}} \right\rbrack}}{{T_{B,{loss}}^{({k\overset{\sim}{t}})} = {\sum\limits_{i = 1}^{n_{CB}}{❘{T_{B,i}^{({k\overset{\sim}{t}})} - T_{B,i}^{({{kt} - k})}}❘}}},{\overset{\sim}{t} > 0},{i \in \left\lbrack {1,n_{CB}} \right\rbrack}}{T_{O,{loss}}^{(0)} = {T_{B,{loss}}^{(0)} = 0}}} & (0.2) \end{matrix}$

where T_(O,i) ^((k{tilde over (t)})) is a gear set value of an i^(th) OLTC device at the moment k{tilde over (t)}, n_(OLTC) is a total number of OLTC devices; T_(B,i) ^((k{tilde over (t)})) is a gear set value of an i^(th) capacitor station at the moment k{tilde over (t)}, and n_(CB) is a total number of capacitor stations;

1-2) establishing the constraints for multi-time scale reactive voltage optimization in the controlled power distribution network:

voltage constraints and output constraints:

V≤V _(i) ^((k{tilde over (t)}+τ)) ≤V,

|Q _(Gi) ^((k{tilde over (t)}+τ))|≤√{square root over (S _(Gi) ²−(P _(Gi) ^((k{tilde over (t)}+τ)))²)},

Q _(Ci) ≤Q _(Ci) ^((k{tilde over (t)}+τ))≤ Q _(Ci) ,

∀i∈N,{tilde over (t)}∈[0,T),τ∈[0,k)   (0.3)

where N is a set of all nodes in the power distribution network, V_(i) ^((k{tilde over (t)}+τ)) is a voltage magnitude of the node i at the moment k{tilde over (t)}+τ, V,V are a lower limit and an upper limit of the node voltage respectively; Q_(Gi) ^((k{tilde over (t)}+τ)) is the DG reactive power output of the node i at the moment k{tilde over (t)}+τ; Q_(Ci) ^((k{tilde over (t)}+τ)) is the SVC reactive power output of the node i at the moment k{tilde over (t)}+τ; Q_(Ci) ,Q_(Ci) are a lower limit and an upper limit of the SVC reactive power output of the node i; S_(Gi) is a DG installed capacity of the node i; P_(Gi) ^((k{tilde over (t)}+τ)) is a DG active power output at the moment k{tilde over (t)}+τ of the node i;

adjustment constraints:

1≤T _(O,i) ^((k{tilde over (t)}))≤ T _(O,i) ,{tilde over (t)}>0i∈[1,n _(OLTC)]

1≤T _(B,i) ^((k{tilde over (t)}))≤ T _(B,i) ,{tilde over (t)}>0i∈[1,n _(CB)]  (0.4)

where T_(O,i) is a number of gears of the i^(th) OLTC device, and T_(B,i) is a number of gears of the i^(th) capacitor station.

2) constructing the hierarchical interaction training framework based on the two-layer Markov decision process based on the optimization model established in step 1) and actual configuration of the power distribution network, comprises:

2-1) corresponding to system measurements of the power distribution network, constructing a state observation s at the moment t shown in the following formula:

s=(P, Q, V, T _(O) , T _(B))_(t)   (0.5)

where P, Q are vectors composed of active power injections and reactive power injections at respective nodes in the power distribution network respectively; V is a vector composed of respective node voltages in the power distribution network; T_(O) is a vector composed of respective OLTC gears, and T_(B) is a vector composed of respective capacitor station gears; t is a discrete time variable of the control process, (·)_(t) represents a value measured at the moment t;

2-2) corresponding to the multi-time scale reactive voltage optimization object, constructing feedback variable r_(f) of the fast continuous device shown in the following formula:

$\begin{matrix} {{r_{f} = {{{- C_{P}}{P_{loss}\left( s^{\prime} \right)}} - {C_{V}{V_{loss}\left( s^{\prime} \right)}}}}{{{P_{loss}\left( s^{\prime} \right)} = {\sum\limits_{i \in N}{P_{i}\left( s^{\prime} \right)}}},{{V_{loss}\left( s^{\prime} \right)} = \sqrt{\sum\limits_{i \in N}\left\lbrack {\left\lbrack {{V_{i}\left( s^{\prime} \right)} - \overset{¯}{V}} \right\rbrack_{+}^{2} + \left\lbrack {\underset{¯}{V} - {V_{i}\left( s^{\prime} \right)}} \right\rbrack_{+}^{2}} \right\rbrack}}}} & (0.6) \end{matrix}$

where s,a,s′ are a state observation at the moment t, an action of the fast continuous device at the moment t and a state observation at the moment t+1 respectively; P_(loss)(s′) is a network loss at the moment t+1; V_(loss)(s′) is a voltage deviation rate at the moment t+1; P_(i)(s′) is the active power output of the node i at the moment t+1; V_(i)(s′) is a voltage magnitude of the node i at the moment t+1; [x]₊=max(0,x); C_(V) is a cost coefficient of voltage violation probability;

2-3) corresponding to the multi-time scale reactive voltage optimization object, constructing feedback variable r_(s) of the slow discrete device shown in the following formula:

r _(s) =−C _(O) T _(O,loss)({tilde over (s)},{tilde over (s)}′)−C _(B) T _(B,loss)({tilde over (s)},{tilde over (s)}′)−R _(f)({s _(τ) ,a _(τ)|τ∈[0, k)},s _(k))   (0.7)

where {tilde over (s)},{tilde over (s)}″ are a state observation at the moment k{tilde over (t)} and a state observation at the moment k{tilde over (t)}+k respectively; T_(O,loss)({tilde over (s)},{tilde over (s)}′) is an OLTC adjustment cost generated by actions at the moment k{tilde over (t)}; T_(B,loss)({tilde over (s)},{tilde over (s)}′) is a capacitor station adjustment cost generated by actions at the moment k{tilde over (t)}; R_(f)({s_(τ),a_(τ)|τ∈[0,k)},s_(k)) is a feedback value of the fast continuous device accumulated between two actions of the slow discrete device, the calculation expression of which is as follows:

$\begin{matrix} {\left. {\left. {R_{f}\left( \left\{ {s_{\tau},{a_{\tau}{❘{\tau \in \left\lbrack {0,k} \right.}}}} \right. \right)} \right\},s_{k}} \right) = {\sum\limits_{\tau = 0}^{k - 1}{r_{f}\left( {s_{\tau},a_{\tau},s_{\tau + 1}} \right)}}} & (0.8) \end{matrix}$

2-4) constructing an action variable a_(t) of the fast agent and an action variable ã_(t) of the slow agent at the moment t shown in the following formula:

a _(t)=(Q _(G) , Q _(C))_(t)

ã _(t)=(T _(O) , T _(B))_(t)   (0.9)

where Q_(G), Q_(C) are vectors of the DG reactive power output and the SVC reactive power output in the power distribution network respectively;

3) setting the slow agent to control the slow discrete device and setting the fast agent to control the fast continuous device, comprise:

3-1) the slow agent is a deep neural network including a slow strategy network {tilde over (π)} and a slow evaluation network Q_(s) ^({tilde over (π)}), wherein an input of the slow strategy network {tilde over (π)} is {tilde over (s)}, an output is probability distribution of an action ã, and a parameter of the slow strategy network {tilde over (π)} is denoted as θ_(s); an input of the slow evaluation network Q_(s) ^({tilde over (π)}) is {tilde over (s)}, an output is an evaluation value of each action, and a parameter of the slow evaluation network Q_(s) ^({tilde over (π)}) are denoted as ϕ_(s);

3-2) the fast agent is a deep neural network including a fast strategy network π and a fast evaluation network Q_(f) ^(π), wherein an input of the fast strategy network π is s, an output is probability distribution of the action a, and a parameter of the fast strategy network π is denoted as θ_(f); an input of the fast evaluation network Q_(f) ^(π) is (s,a), an output is an evaluation value of actions, and a parameter of the fast evaluation network Q_(f) ^(π) is denoted as ϕ_(f);

4) initializing parameters:

4-1) randomly initializing parameters of the neural networks corresponding to respective agents θ_(s), θ_(f), ϕ_(s), ϕ_(f);

4-2) inputting a maximum entropy parameter α_(s) of the slow agent and a maximum entropy parameter α_(f) of the fast agent;

4-3) initializing the discrete time variable as t=0, an actual time interval between two steps of the fast agent is Δt, and an actual time interval between two steps of the slow agent is kΔt;

4-4) initializing an action probability of the fast continuous device as p=−1;

4-5) initializing cache experience database as D_(l)=∅ and initializing agent experience database as D=∅;

5) executing by the slow agent and the fast agent, the following control steps at the moment t:

5-1) judging if t mod k≠0: if yes, going to step 5-5) and if no, going to step 5-2);

5-2) obtaining by the slow agent, state information from measurement devices in the power distribution network;

5-3) judging if D_(l)≠∅: if yes, calculating r_(s), adding an experience sample to D, updating D←D∪{({tilde over (s)},ã,r_(s),{tilde over (s)}′,D_(l))} and going to step 5-4); if no, directly going to step 5-4);

5-4) updating {tilde over (s)} to {tilde over (s)}′;

5-5) generating the action ã of the slow discrete device with the slow strategy network {tilde over (π)} of the slow agent according to the state information {tilde over (s)};

5-6) distributing ã′ to each slow discrete device to realize the reactive voltage control of each slow discrete device at the moment t;

5-7) obtaining by the fast agent, state information s′ from measurement devices in the power distribution network;

5-8) judging if p≥0: if yes, calculating r_(f), adding an experience sample to D_(l), updating D_(l)←D_(l)∪{(s,a,r_(f),s′,p)} and going to step 5-9); if no, directly going to step 5-9);

5-9) updating s to s′;

5-10) generating the action a of the slow discrete device with the fast strategy network {tilde over (π)} of the fast agent according to the state information s and updating p=π(a|s);

5-11) distributing a to each fast continuous device to realize the reactive voltage control of each fast continuous device at the moment t and going to step 6);

6) judging t mod k=0: if yes, going to step 6-1); if no, going to step 7);

6-1) randomly selecting a set of experiences D^(B)∈D from the agent experience database D, wherein a number of samples in the set of experiences is B;

6-2) calculating a loss function of the parameter ϕ_(s) with each sample in D^(B):

$\begin{matrix} {{{L\left( \phi_{s} \right)} = {\underset{\overset{\sim}{s},\overset{\sim}{a},r_{s},{\overset{\sim}{s}}^{\prime}}{E}\left\lbrack \left( {{Q_{s}^{\pi}\left( {\overset{\sim}{s},\overset{\sim}{a}} \right)} - {\omega y_{s}}} \right)^{2} \right\rbrack}}{where}} & (0.1) \end{matrix}$ $\begin{matrix} {y_{s} = {r_{s} + {\gamma\left\lbrack {{Q_{s}^{\pi}\left( {{\overset{\sim}{s}}^{\prime},{\overset{\sim}{a}}^{\prime}} \right)} - {\alpha_{s}\log{\overset{\sim}{\pi}\left( {{\overset{\sim}{a}}^{\prime}{❘\overset{\sim}{s}}} \right)}}} \right\rbrack}}} & (0.11) \end{matrix}$

where ã′˜{tilde over (π)}(·|{tilde over (s)}) and γ is a conversion factor;

ω = ∏ i = 0 k - 1 π ⁡ ( a i ⁢ ❘ "\[LeftBracketingBar]" s i ) p ⁡ ( a i ⁢ ❘ "\[LeftBracketingBar]" s i ) ( 0.12 )

6-3) updating the parameter ϕ_(s):

ϕ_(s)←ϕ_(s)−ρ_(s)∇_(ϕ) _(s) L(ϕ_(s))   (0.13)

where ρ_(s) is a learning step length of the slow discrete device;

6-4) calculating a loss function of the parameter θ_(s);

$\begin{matrix} {{L\left( \theta_{s} \right)} = {- {\underset{\overset{\sim}{s} \in D^{B}}{E}\left\lbrack {Q_{s}^{\pi}\left( {\overset{\sim}{s},{\overset{\sim}{a} \sim {{\overset{\sim}{\pi}}_{\theta_{s}}\left( {\cdot {❘\overset{\sim}{s}}} \right)}}} \right)} \right\rbrack}}} & (0.14) \end{matrix}$

6-5) updating the parameter θ_(s):

θ_(s)←θ_(s)−ρ_(s)∇_(θ) _(s) L(θ_(s))   (0.15)

and going to step 7);

7) executing by the fast agent, the following learning steps at the moment t:

7-1) randomly selecting a set of experiences D^(B)∈D from the agent experience database D, wherein a number of samples in the set of experiences is B;

7-2) calculating a loss function of the parameter ϕ_(f) with each sample in D^(B):

$\begin{matrix} {{{L\left( \phi_{f} \right)} = {\underset{s,a,r_{f},s^{\prime}}{E}\left\lbrack \left( {{Q_{f}^{\pi}\left( {s,a} \right)} - y_{f}} \right)^{2} \right\rbrack}}{where}} & (0.16) \end{matrix}$ $\begin{matrix} {y_{f} = {r_{f} + {\gamma\left\lbrack {{Q_{f}^{\pi}\left( {s^{\prime},a^{\ \prime}} \right)} - {\alpha_{f}\log{\pi\left( {a^{\prime}{❘s}} \right)}}} \right\rbrack}}} & (0.17) \end{matrix}$

where a′˜π(·|s);

7-3) updating the parameter ϕ_(f):

ϕ_(f)←ϕ_(f)−ρ_(f)∇_(ϕ) _(f) L(ϕ_(f))   (0.18)

where ρ_(f) is a learning step length of the fast continuous device;

7-4) calculating a loss function of the parameter θ_(f);

$\begin{matrix} {{L\left( \theta_{f} \right)} = {- {\underset{\overset{\sim}{s} \in D^{B}}{E}\left\lbrack {Q_{f}^{\pi}\left( {\overset{\sim}{s},{\overset{\sim}{a} \sim {{\overset{\sim}{\pi}}_{\theta_{f}}\left( {\cdot {❘\overset{\sim}{s}}} \right)}}} \right)} \right\rbrack}}} & (0.19) \end{matrix}$

7-5) updating the parameter θ_(f):

θ_(f)←θ_(f)−ρ_(f)∇_(θ) _(f) L(θ_(f))   (0.20)

8) let t=t+1, returning to step 5).

The advantages and beneficial effects of the disclosure lie in:

the reactive power voltage control problem in the disclosure is established as a two-layer Markov decision process, the slow agent is set for a long-time scale device (such as an OLTC, a capacitor station, etc.), and the fast agent is set for a short-time scale device (such as a DG, a static var compensator SVC, etc.). The agents are implemented by reinforcement learning algorithms. The method for hierarchical reinforcement learning according to the present disclosure is used for training, the convergence is efficiently optimized in the interactive control, and each agent may independently decide the action value of the controlled device according to the inputted measurement information, thereby achieving the multi-time scale reactive voltage control. On the one hand, the controllable resources of the multi-time scale devices are fully used, the fast and slow devices are fully decoupled in the control phase to perform fast and slow coordinated multi-time scale reactive voltage control. On the other hand, the method for hierarchical reinforcement learning is proposed with high efficiency, in which a joint learning of fast and slow agents is realized by interaction factors, mutual interferences between the fast and slow agents are avoided in the learning, historical data may be fully used of for learning, an optimal strategy of each agent is quickly obtained, thereby ensuring the optimal operation of the system with incomplete models.

1. Compared with the conventional method for multi-time-scale reactive voltage control, the method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network according to the disclosure has model-free characteristics, that is, the optimal control strategy may be obtained through online learning without requiring accurate models of the power distribution network. Further, the disclosure may avoid control deterioration caused by model errors, thereby ensuring the effectiveness of reactive voltage control, improving the efficiency and safety of power grid operation, and being suitable for deployment in actual power systems.

2. A fast agent and a slow agent in the disclosure are separately established for the fast continuous device and the slow discrete device. In the control phase, the two agents are fully decoupled, and multi-time scale control commands may be generated. Compared with the conventional method for reactive voltage control based on reinforcement learning, the method in the disclosure may ensure that the adjustment capabilities of the fast continuous device and the slow discrete device are used to the greatest extent, thereby fully optimizing the operation states of the power distribution network and improving the consumption of renewable energy.

3. In the learning process of the fast agent and the slow agent, the joint learning of multi-time-scale agents is realized in the disclosure by interaction factors, mutual interferences between the agents are avoided in the learning phase, and the fast convergence is realized in the reinforcement learning process. In addition, full mining of massive historical data is also supported with high sample efficiency and a fully optimized control strategy may be obtained after a few iterations in the disclosure, which is suitable for scenarios of lacking samples in the power distribution network.

4. The disclosure may realize continuous online operation of the multi-time scale reactive voltage control in the power distribution network, ensure the high safety, high efficiency and high flexibility of the operation, thereby greatly improving the voltage quality of the power grid, reducing the network loss of the power grid and having very high application value.

DETAILED DESCRIPTION

A method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network is provided in the disclosure. The method includes: determining a multi-time scale reactive voltage control object based on a reactive voltage control object of a slow discrete device and a reactive voltage control object of a fast continuous device in a controlled power distribution network, and establishing constraints for multi-time scale reactive voltage optimization, to constitute an optimization model for multi-time scale reactive voltage control in the power distribution network; constructing a hierarchical interaction training framework based on a two-layer Markov decision process based on the model; setting a slow agent for the slow discrete device and setting a fast agent for the fast continuous device; and performing online control with the slow agent and the fast agent, in which action values of the controlled devices are decided by each agent based on measurement information inputted, so as to realize the multi-time scale reactive voltage control while the slow agent and the fast agent perform continuous online learning and updating. The method includes the following steps.

1) according to a reactive voltage control object of a slow discrete device (which refers to a device that performs control by adjusting gears in an hour-level action cycle, such as an OLTC, a capacitor station, etc.) and a reactive voltage control object of a fast continuous device (which refers to a device that performs control by adjusting continuous set values in minute-level action cycle, such as a distributed generation DG, a static var compensator SVC, etc.) in the controlled power distribution network, the multi-time scale reactive voltage control object is determined, and optimization constraints are established for multi-time scale reactive voltage control, to constitute the optimization model for multi-time scale reactive voltage control in the power distribution network. The specific steps are as follows.

1-1) the multi-time scale reactive voltage control object of the controlled power distribution network is determined:

$\begin{matrix} {O_{T} = {\min_{T_{o},T_{B}}{\sum\limits_{\overset{\sim}{t} = 0}^{\overset{\sim}{T} - 1}\left\lbrack {{C_{O}T_{O,{loss}}^{({k\overset{\sim}{t}})}} + {C_{B}T_{B,{loss}}^{({k\overset{\sim}{t}})}} + {C_{P}\min_{Q_{G},Q_{C}}{\sum\limits_{\tau = 0}^{k - 1}P_{loss}^{({{k\overset{\sim}{t}} + \tau})}}}} \right\rbrack}}} & (0.21) \end{matrix}$

where {tilde over (T)} is a number of control cycles of the slow discrete device in one day; k is an integer which represents a multiple of a number of control cycles of the fast continuous device to the number of control cycles of the slow discrete device in one day; T=k{tilde over (T)} is the number of control cycles of the fast continuous device in one day; {tilde over (t)} is a number of control cycles of the slow discrete device; T_(O) is a gear of an on-load tap changer OLTC; T_(B) is a gear of a capacitor station; Q_(G) is a reactive power output of the distributed generation DG; Q_(C) is a reactive power output of a static var compensator SVC; C_(O),C_(B),C_(P) respectively are an OLTC adjustment cost, a capacitor station adjustment cost and an active power network loss cost; P_(loss) ^((k{tilde over (t)}+τ)) is a power distribution network loss at the moment k{tilde over (t)}+τ, τ being an integer, τ=0, 1, 2 . . . ,k−1; T_(O,loss) ^((k{tilde over (t)})) is a gear change adjusted by the OLTC at the moment k{tilde over (t)}, and T_(B,loss) ^((k{tilde over (t)})) is a gear change adjusted by the capacitor station at the moment k{tilde over (t)}, which are respectively calculated by the following formulas:

$\begin{matrix} {{{T_{O,{loss}}^{({k\overset{\sim}{t}})} = {\sum\limits_{i = 1}^{n_{OLTC}}{❘{T_{O,i}^{({k\overset{\sim}{t}})} - T_{O,i}^{{\langle{{k\overset{\sim}{t}} - k}})}}❘}}},{\overset{\sim}{t} > 0},{i \in \left\lbrack {1,n_{OLTC}} \right\rbrack}}{{T_{B,{loss}}^{({k\overset{\sim}{t}})} = {\sum\limits_{i = 1}^{n_{CB}}{❘{T_{B,i}^{{\langle{k\overset{\sim}{t}}})} - T_{B,i}^{{\langle{{kt} - k}})}}❘}}},{\overset{\sim}{t} > 0},{i \in \left\lbrack {1,n_{CB}} \right\rbrack}}{T_{O,{loss}}^{{\langle 0})} = {T_{B,{loss}}^{{\langle 0})} = 0}}} & (0.22) \end{matrix}$

where T_(O,i) ^((k{tilde over (t)})) is a gear set value of an i^(th) OLTC device at the moment k{tilde over (t)}, n_(OLTC) is a total number of OLTC devices; T_(B,i) ^((k{tilde over (t)})) is a gear set value of an i^(th) capacitor station at the moment k{tilde over (t)}, and n_(CB) is a total number of capacitor stations;

1-2) the constraints are established for multi-time scale reactive voltage optimization in the controlled power distribution network:

the constraints for reactive voltage optimization are established according to actual conditions of the controlled power distribution network, including voltage constraints and output constraints expressed by:

V≤V _(i) ^((k{tilde over (t)}+τ)) ≤V,

|Q _(Gi) ^((k{tilde over (t)}+τ))|≤√{square root over (S _(Gi) ²−(P _(Gi) ^((k{tilde over (t)}+τ)))²)},

Q _(ci) ≤Q _(Ci) ^((k{tilde over (t)}+τ))≤Q_(Ci) ,

∀i∈N, {tilde over (t)}∈[0, T), τ∈[0, k)   (0.23)

where N is a set of all nodes in the power distribution network, V_(i) ^((k{tilde over (t)}+τ)) is a voltage magnitude of the node i at the moment k{tilde over (t)}+τ, V,V are a lower limit and an upper limit (the typical values are respectively 0.9 and 1.1) of the node voltage respectively; Q_(Gi) ^((k{tilde over (t)}+τ)) is the DG reactive power output of the node i at the moment k{tilde over (t)}+τ; Q_(Ci) ^((k{tilde over (t)}+τ)) is the SVC reactive power output of the node i at the moment k{tilde over (t)}+τ; Q_(Ci) ,Q_(Ci) are a lower limit and an upper limit of the SVC reactive power output of the node i; S_(Gi) is a DG installed capacity of the node i; P_(Gi) ^((k{tilde over (t)}+τ)) is a DG active power output at the moment k{tilde over (t)}+τ of the node i;

adjustment constraints are expressed by:

1≤T_(O,i) ^((k{tilde over (t)}+τ))≤ T _(O,i) , {tilde over (t)}>0, i∈[1, n _(OLTC)]

1≤T_(B,i) ^((k{tilde over (t)}+τ))≤ T _(B,i) , {tilde over (t)}>0, i∈[1, n _(CB)]  (0.24)

where T_(O,i) is a number of gears of the i^(th) OLTC device, and T_(B,i) is a number of gears of the i^(th) capacitor station.

2) in combination with the optimization model established in step 1) and actual configuration of the power distribution network, the hierarchical interaction training framework based on the two-layer Markov decision process is constructed. The specific steps are as follows:

2-1) corresponding to system measurements of the power distribution network, a state observation s at the moment t is constructed in the following formula:

s=(P, Q, V, T _(O) , T _(B))_(t)   (0.25)

where P, Q are vectors composed of active power injections and reactive power injections at respective nodes in the power distribution network respectively; V is a vector composed of respective node voltages in the power distribution network; T_(O) is a vector composed of respective OLTC gears, and T_(B) is a vector composed of respective capacitor station gears; t is a discrete time variable of the control process, (·)_(t) represents a value measured at the moment t;

2-2) corresponding to the multi-time scale reactive voltage optimization object, feedback variable r_(f) of the fast continuous device is constructed in the following formula:

$\begin{matrix} {{r_{f} = {{{- C_{P}}{P_{loss}\left( s^{\prime} \right)}} - {C_{V}{V_{loss}\left( s^{\prime} \right)}}}}{{{P_{loss}\left( s^{\prime} \right)} = {\sum\limits_{i \in N}{P_{i}\left( s^{\prime} \right)}}},}} & (0.26) \end{matrix}$ ${V_{loss}\left( s^{\prime} \right)} = \sqrt{\sum\limits_{i \in N}\left\lbrack {\left\lbrack {{V_{i}\left( s^{\prime} \right)} - \overset{¯}{V}} \right\rbrack_{+}^{2} + \left\lbrack {\underset{¯}{V} - {V_{i}\left( s^{\prime} \right)}} \right\rbrack_{+}^{2}} \right\rbrack}$

where s,a,s′ are a state observation at the moment t, an action of the fast continuous device at the moment t and a state observation at the moment t+1 respectively; P_(loss)(s′) is a network loss at the moment t+1; V_(loss)(s′) is a voltage deviation rate at the moment t+1; P_(i)(s′) is the active power output of the node i at the moment t+1; V_(i)(s′) is a voltage magnitude of the node i at the moment t+1; [x]₊=max(0, x); C_(V) is a cost coefficient of voltage violation probability;

2-3) corresponding to the multi-time scale reactive voltage optimization object, feedback variable r_(s) of the slow discrete device is constructed in the following formula:

r _(s) =−C _(O) T _(O,loss)({tilde over (s)}, {tilde over (s)}′)−C _(B) T _(B,loss)({tilde over (s)}, {tilde over (s)}′)−R _(f)({s _(τ) , a _(τ)|τ∈[0, k)}, s _(k))   (0.27)

where are {tilde over (s)},{tilde over (s)}′ a state observation at the moment k{tilde over (t)} and a state observation at the moment k{tilde over (t)}+k respectively; T_(O,loss)({tilde over (s)},{tilde over (s)}′) is an OLTC adjustment cost generated by actions at the moment k{tilde over (t)}; T_(B,loss)({tilde over (s)},{tilde over (s)}′) is a capacitor station adjustment cost generated by actions at the moment k{tilde over (t)}; R_(f)({s_(τ),a_(τ)|τ∈[0,k)},s_(k)) is a feedback value of the fast continuous device accumulated between two actions of the slow discrete device, the calculation expression of which is as follows:

$\begin{matrix} {\left. {\left. {R_{f}\left( \left\{ {s_{\tau},{a_{\tau}{❘{\tau \in \left\lbrack {0,k} \right.}}}} \right. \right)} \right\},s_{k}} \right) = {\sum\limits_{\tau = 0}^{k - 1}{r_{f}\left( {s_{\tau},a_{\tau},s_{\tau + 1}} \right)}}} & (0.28) \end{matrix}$

2-4) corresponding to each adjustable resource, an action variable a_(t) of the fast agent and an action variable ã_(t) of the slow agent at the moment t are constructed in the following formula:

a _(t)=(Q _(G) , Q _(C))_(t)

ã _(t)=(T _(O) , T _(B))_(t)   (0.29)

where Q_(G), Q_(C) are vectors of the DG reactive power output and the SVC reactive power output in the power distribution network respectively;

3) the slow agent is set to control the slow discrete device and the fast agent is set to control the fast continuous device. The specific steps are as follows.

3-1) the slow agent is implemented by a deep neural network including a slow strategy network {tilde over (π)} and a slow evaluation network Q_(s) ^({tilde over (π)}).

-   -   3-1-1) the slow strategy network {tilde over (π)} is a deep         neural network with an input being {tilde over (s)} and an         output being probability distribution of an action ã, which         includes several hidden layers (typically 2 hidden layers), each         hidden layer having several neurons (typically 512 neurons), an         activation function being the ReLU function, and a network         parameter denoted as θ_(s);     -   3-1-2) the slow evaluation network Q_(s) ^({tilde over (π)}) is         a deep neural network with an input being {tilde over (s)} and         an output being an evaluation value of each action, which         includes several hidden layers (typically 2 hidden layers), each         hidden layer having several neurons (typically 512 neurons), an         activation function being the ReLU function, and a network         parameter denoted as ϕ_(s);

3-2) the fast agent is implemented by a deep neural network including a fast strategy network π and a fast evaluation network Q_(f) ^(π).

-   -   3-2-1) the fast strategy network π is a deep neural network with         an input being s and an output being probability distribution of         the action a, which includes several hidden layers (typically 2         hidden layers), each hidden layer having several neurons         (typically 512 neurons), an activation function being the ReLU         function, and a network parameter denoted as θ_(f);     -   3-2-2) the fast evaluation network Q_(f) ^(π) is a deep neural         network with an input being (s,a) and an output being an         evaluation value of actions, which includes several hidden         layers (typically 2 hidden layers), each hidden layer having         several neurons (typically 512 neurons), an activation function         being the ReLU function, and a network parameter denoted as         ϕ_(f);

4) the variables in the relevant control processes are initialized.

4-1) parameters of the neural networks corresponding to respective agents θ_(s), θ_(f), ϕ_(s), ϕ_(f) are randomly initialized;

4-2) a maximum entropy parameter α_(s) of the slow agent and a maximum entropy parameter α_(f) of the fast agent are input, which are respectively configured to control the randomness of the slow and fast agents and a typical value of which is 0.01;

4-3) the discrete time variable is initialized as t=0, an actual time interval between two steps of the fast agent is Δt and an actual time interval between two steps of the slow agent is kΔt, which are determined according to the actual measurements of the local controller and the command control speed;

4-4) an action probability of the fast continuous device is initialized as p=−1;

4-5) experience databases are initialized, in which cache experience database is initialized as D_(l)=∅ and agent experience database is initialized as D=∅;

5) the slow agent and the fast agent execute the following control steps at the moment t:

5-1) it is judged whether t mod k≠0. If yes, step 5-5) is performed and if no, step 5-2) is performed;

5-2) the slow agent obtains state information {tilde over (s)}′ from measurement devices in the power distribution network;

5-3) it is judged whether D_(l)≠∅. If yes, r_(s) is calculated, an experience sample is added to D, D←D∪{{tilde over (s)},ã,r_(s),{tilde over (s)}′,D_(l))} is updated, and step 5-4) is performed; if no, step 5-4) is directly performed;

5-4) let {tilde over (s)}←{tilde over (s)}′;

5-5) the action ã of the slow discrete device is generated with the slow strategy network {tilde over (π)} of the slow agent according to the state information {tilde over (s)};

5-6) ã is distributed to each slow discrete device to realize the reactive voltage control of each slow discrete device at the moment t;

5-7) the fast agent obtains state information s′ from measurement devices in the power distribution network;

5-8) it is judged whether p≥0. If yes, r_(f) is calculated, an experience sample is added to D_(l), D_(l)←D_(l)∪{(s,a,r_(f),s′,p)} is updated, and step 5-9) is performed; if no, step 5-9) is directly performed;

5-9) let s←s′;

5-10) the action a of the fast continuous device is generated with the fast strategy network {tilde over (π)} of the fast agent according to the state information s and p=π(a|s) is updated;

5-11) a is distributed to each fast continuous device to realize the reactive voltage control of each fast continuous device at the moment t and step 6) is performed;

6) it is judged whether t mod k=0. If yes, step 6-1) is performed; if no, step 7) is performed;

6-1) a set of experiences D^(B)∈D is randomly selected from the agent experience database D, wherein a number of samples in the set of experiences is B (a typical value is 64);

6-2) a loss function of the parameter ϕ_(s) is calculated with each sample in D^(B):

$\begin{matrix} {{L\left( \phi_{s} \right)} = {\underset{\overset{\sim}{s},\overset{\sim}{a},r_{s},{\overset{\sim}{s}}^{\prime}}{E}\left\lbrack \left( {{Q_{s}^{\pi}\left( {\overset{\sim}{s},\overset{\sim}{a}} \right)} - {\omega y_{s}}} \right)^{2} \right\rbrack}} & (0.3) \end{matrix}$

where

$\underset{\overset{\sim}{s},\overset{\sim}{a},r_{s},{\overset{\sim}{s}}^{\prime}}{E}$

is taken from D^(B) and y_(s) is determined by:

y _(s) =r _(s)+γ[Q _(s) ^(π)({tilde over (s)}′, ã′)−α_(s) log {tilde over (π)}(ã′|{tilde over (s)})]   (0.31)

where ã′˜{tilde over (π)}(·|{tilde over (s)}) and γ is a conversion factor, a typical value of which is 0.98;

$\begin{matrix} {\omega = {\prod\limits_{i = 0}^{k - 1}\frac{\pi\left( {a_{i}❘s_{i}} \right)}{p\left( {a_{i}❘s_{i}} \right)}}} & (0.32) \end{matrix}$

6-3) the parameter ϕ_(s) is updated:

ϕ_(s)←ϕ_(s)−ρ_(s)∇_(ϕ) _(s) L(ϕ_(s))   (0.33)

where ρ_(s) is a learning step length of the slow discrete device, a typical value of which is 0.0001;

6-4) a loss function of the parameter θ_(s) is calculated:

$\begin{matrix} {{L\left( \theta_{s} \right)} = {- {\underset{\overset{\sim}{s} \in D^{B}}{E}\left\lbrack {Q_{s}^{\pi}\left( {\overset{\sim}{s},{\left. \overset{\sim}{a} \right.\sim{{\overset{\sim}{\pi}}_{\theta_{s}}\left( {\cdot {❘\overset{\sim}{s}}} \right)}}} \right)} \right\rbrack}}} & (0.34) \end{matrix}$

6-5) the parameter θ_(s) is updated:

θ_(s)←θ_(s)−ρ_(s)∇_(θ) _(s) L(θ_(s))   (0.35)

and step 7) is then performed;

7) the fast agent executes the following learning steps at the moment t:

7-1) a set of experiences D^(B)∈D is randomly selected from the agent experience database D, wherein a number of samples in the set of experiences is B (a typical value is 64);

7-2) a loss function of the parameter ϕ_(f) is calculated with each sample in D^(B):

$\begin{matrix} {{L\left( \phi_{f} \right)} = {\underset{s,a,r_{f},s^{\prime}}{E}\left\lbrack \left( {{Q_{f}^{\pi}\left( {s,a} \right)} - y_{f}} \right)^{2} \right\rbrack}} & (0.36) \end{matrix}$

where

$\underset{s,a,r_{f},s^{\prime}}{E}$

is taken from D^(B) and y_(f) is determined by:

y _(f) =r _(f)+γ[Q _(f) ^(π)(s′,a′)−α_(f) log π(a′|s)]  (0.37)

where a′˜π(·|s);

7-3) the parameter ϕ_(f) is updated:

ϕ_(f)←ϕ_(f)−ρ_(f)∇_(ϕ) _(f) L(ϕ_(f))   (0.18)

where ρ_(f) is a learning step length of the fast continuous device, a typical value of which is 0.00001;

7-4) a loss function of the parameter θ_(f) is calculated:

$\begin{matrix} {{L\left( \theta_{f} \right)} = {- {\underset{s \in D^{B}}{E}\left\lbrack {Q_{f}^{\pi}\left( {s,{\left. a \right.\sim{\pi_{\theta_{f}}\left( {\cdot {❘s}} \right)}}} \right)} \right\rbrack}}} & (0.39) \end{matrix}$

7-5) the parameter θ_(f) is updated:

θ_(f)←θ_(f)−ρ_(f)∇_(θ) _(f) L(θ_(f))   (0.20)

8) let t=t+1, it returns to step 5) and repeats the steps 5) to 8). The method is an online learning control method, which continuously runs online and updates the neural network, while performing online control until the user manually stops it. 

What is claimed is:
 1. A method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network, comprising: determining a multi-time scale reactive voltage control object based on a reactive voltage control object of a slow discrete device and a reactive voltage control object of a fast continuous device in a controlled power distribution network, and establishing constraints for multi-time scale reactive voltage optimization, to constitute an optimization model for multi-time scale reactive voltage control in the power distribution network; constructing a hierarchical interaction training framework based on a two-layer Markov decision process based on the model; setting a slow agent for the slow discrete device and setting a fast agent for the fast continuous device; and performing online control with the slow agent and the fast agent, in which action values of the controlled devices are decided by each agent based on measurement information inputted, so as to realize the multi-time scale reactive voltage control while the slow agent and the fast agent perform continuous online learning and updating.
 2. The method as claimed in claim 1, further comprising: 1) determining the multi-time scale reactive voltage control object and establishing the constraints for multi-time scale reactive voltage optimization, to constitute the optimization model for multi-time scale reactive voltage control in the power distribution network comprises: 1-1) determining the multi-time scale reactive voltage control object of the controlled power distribution network: $\begin{matrix} {O_{T} = {\min_{T_{O},T_{B}}{\sum\limits_{\overset{\sim}{t} = 0}^{\overset{\sim}{T} - 1}\left\lbrack {{C_{O}T_{O,{loss}}^{({k\overset{\sim}{t}})}} + {C_{B}T_{B,{loss}}^{({k\overset{\sim}{t}})}} + {C_{P}\min_{Q_{G},Q_{C}}{\sum\limits_{\tau = 0}^{k - 1}P_{loss}^{({{k\overset{\sim}{t}} + \tau})}}}} \right\rbrack}}} & (0.41) \end{matrix}$ where {tilde over (T)} is a number of control cycles of the slow discrete device in one day; k is an integer which represents a multiple of a number of control cycles of the fast continuous device to the number of control cycles of the slow discrete device in one day; T=k{tilde over (T)} is the number of control cycles of the fast continuous device in one day; {tilde over (t)} is a number of control cycles of the slow discrete device; T_(O) is a gear of an on-load tap changer OLTC; T_(B) is a gear of a capacitor station; Q_(G) is a reactive power output of the distributed generation DG; Q_(C) is a reactive power output of a static var compensator SVC; C_(O),C_(B),C_(P) respectively are an OLTC adjustment cost, a capacitor station adjustment cost and an active power network loss cost; P_(loss) ^((k{tilde over (t)}+τ)) is a power distribution network loss at the moment k{tilde over (t)}+τ, τ being an integer, τ=0,1, 2, . . . ,k−1; and T_(O,loss) ^((k{tilde over (t)})) is a gear change adjusted by the OLTC at the moment k{tilde over (t)}, and T_(B,loss) ^((k{tilde over (t)})) is a gear change adjusted by the capacitor station at the moment k{tilde over (t)}, which are respectively calculated by the following formulas: $\begin{matrix} \begin{matrix} {{T_{O,{loss}}^{({k\overset{\sim}{t}})} = {\sum\limits_{i = 1}^{n_{OLTC}}{❘{T_{O,i}^{({k\overset{\sim}{t}})} - T_{O,i}^{({{k\overset{\sim}{t}} - k})}}❘}}},{\overset{\sim}{t} > 0},{i \in \left\lbrack {1,n_{OLTC}} \right\rbrack}} \\ {{T_{B,{loss}}^{({k\overset{\sim}{t}})} = {\sum\limits_{i = 1}^{n_{CB}}{❘{T_{B,i}^{({k\overset{\sim}{t}})} - T_{B,i}^{({{kt} - k})}}❘}}},{\overset{\sim}{t} > 0},{i \in \left\lbrack {1,n_{CB}} \right\rbrack}} \\ {T_{O,{loss}}^{(0)} = {T_{B,{loss}}^{(0)} = 0}} \end{matrix} & (0.42) \end{matrix}$ where T_(O,i) ^((k{tilde over (t)})) is a gear set value of an i^(th) OLTC device at the moment k{tilde over (t)}, n_(OLTC) is a total number of OLTC devices; T_(B,i) ^((k{tilde over (t)})) is a gear set value of an i^(th) capacitor station at the moment k{tilde over (t)}, and n_(CB) is a total number of capacitor stations; 1-2) establishing the constraints for multi-time scale reactive voltage optimization in the controlled power distribution network which include: voltage constraints and output constraints: V≤V _(i) ^((k{tilde over (t)}+τ)) ≤V, |Q _(Gi) ^((k{tilde over (t)}+τ))|≤√{square root over (S _(Gi) ²−(P _(Gi) ^((k{tilde over (t)}+τ)))²)}, Q _(Ci) ≤Q _(Ci) ^((k{tilde over (t)}+τ))≤ Q _(Ci) , ∀i ∈N ,{tilde over (t)}∈[0,T),τ∈[0, k)   (0.43) where N is a set of all nodes in the power distribution network, V_(i) ^((k{tilde over (t)}+τ)) is a voltage magnitude of the node i at the moment k{tilde over (t)}+τ, V,V are a lower limit and an upper limit of the node voltage respectively; Q_(Gi) ^((k{tilde over (t)}+τ)) is the DG reactive power output of the node i at the moment k{tilde over (t)}+τ; Q_(Ci) ^((k{tilde over (t)}+τ)) is the SVC reactive power output of the node i at the moment k{tilde over (t)}+τ; Q_(Ci) ,Q_(Ci) are a lower limit and an upper limit of the SVC reactive power output of the node i; S_(Gi) is a DG installed capacity of the node i; P_(Gi) ^((k{tilde over (t)}+τ)) is a DG active power output at the moment k{tilde over (t)}+τ of the node i; adjustment constraints: 1≤T _(O,i) ^((k{tilde over (t)}))≤ T _(O,i) ,{tilde over (t)}>0,i∈[1,n _(OLTC)] 1≤T _(B,i) ^((k{tilde over (t)}))≤ T _(B,i) ,{tilde over (t)}>0,i∈[1,n _(CB)]  (0.44) where T_(O,i) is a number of gears of the i^(th) OLTC device, and T_(B,i) is a number of gears of the i^(th) capacitor station. 2) constructing the hierarchical interaction training framework based on the two-layer Markov decision process based on the optimization model established in step 1) and actual configuration of the power distribution network, comprises: 2-1) corresponding to system measurements of the power distribution network, constructing a state observation s at the moment t shown in the following formula: s=(P, Q, V, T _(O) , T _(B))_(t)   (0.45) where P, Q are vectors composed of active power injections and reactive power injections at respective nodes in the power distribution network respectively; V is a vector composed of respective node voltages in the power distribution network; T_(O) is a vector composed of respective OLTC gears, and T_(B) is a vector composed of respective capacitor station gears; t is a discrete time variable of the control process, (·)_(t) represents a value measured at the moment t; 2-2) corresponding to the multi-time scale reactive voltage optimization object, constructing feedback variable r_(f) of the fast continuous device shown in the following formula: $\begin{matrix} \begin{matrix} {r_{f} = {{- C_{P}{P_{loss}\left( s^{\prime} \right)}} - {C_{V}{V_{loss}\left( s^{\prime} \right)}}}} \\ {{{P_{loss}\left( s^{\prime} \right)} = {\sum\limits_{i \in N}{P_{i}\left( s^{\prime} \right)}}},} \\ {{V_{loss}\left( s^{\prime} \right)} = \sqrt{\sum\limits_{i \in N}\left\lbrack {\left\lbrack {{V_{i}\left( s^{\prime} \right)} - \overset{\_}{V}} \right\rbrack_{+}^{2} + \left\lbrack {V_{-} - {V_{i}\left( s^{\prime} \right)}} \right\rbrack_{+}^{2}} \right\rbrack}} \end{matrix} & (0.46) \end{matrix}$ where s,a,s′ are a state observation at the moment t, an action of the fast continuous device at the moment t and a state observation at the moment t+1 respectively; P_(loss)(s′) is a network loss at the moment t+1; V_(loss)(s′) is a voltage deviation rate at the moment t+1; P_(i)(s′) is the active power output of the node i at the moment t+1; V_(i)(s′) is a voltage magnitude of the node i at the moment t+1; [x]₊=max(0, x) ; C_(V) is a cost coefficient of voltage violation probability; 2-3) corresponding to the multi-time scale reactive voltage optimization object, constructing feedback variable r_(s) of the slow discrete device shown in the following formula: r _(s) =−C _(O) T _(O,loss)({tilde over (s)},{tilde over (s)}′)−C _(B) T _(B,loss)({tilde over (s)},{tilde over (s)}′)−R _(f)({s _(τ) , a _(τ)|τ∈[0, k)},s _(k))   (0.47) where {tilde over (s)},{tilde over (s)}′ are a state observation at the moment k{tilde over (t)} and a state observation at the moment k{tilde over (t)}+k; T_(O,loss)({tilde over (s)},{tilde over (s)}′) is an OLTC adjustment cost generated by actions at the moment k{tilde over (t)}; T_(B,loss)({tilde over (s)},{tilde over (s)}′) is a capacitor station adjustment cost generated by actions at the moment k{tilde over (t)}; R_(f)({s_(τ),a_(τ)|τ∈[0,k)},s_(k)) is a feedback value of the fast continuous device accumulated between two actions of the slow discrete device, the calculation expression of which is as follows: $\begin{matrix} {\left. {\left. {R_{f}\left( \left\{ {s_{\tau},{a_{\tau}❘{\tau \in \left\lbrack {0,k} \right.}}} \right. \right)} \right\},s_{k}} \right) = {\sum\limits_{\tau = 0}^{k - 1}{r_{f}\left( {s_{\tau},a_{\tau},s_{\tau + 1}} \right)}}} & (0.48) \end{matrix}$ 2-4) constructing an action variable a_(t) of the fast agent and an action variable ã_(t) of the slow agent at the moment t shown in the following formula: a _(t)=(Q _(G) , Q _(C))_(t) ã _(t)=(T _(O) , T _(B))_(t)   (0.49) where Q_(G),Q_(C) are vectors of the DG reactive power output and the SVC reactive power output in the power distribution network; 3) setting the slow agent to control the slow discrete device and setting the fast agent to control the fast continuous device, comprise: 3-1) the slow agent is a deep neural network including a slow strategy network {tilde over (π)} and a slow evaluation network Q_(s) ^({tilde over (π)}), wherein an input of the slow strategy network {tilde over (π)} is {tilde over (s)}, an output is probability distribution of an action ã, and a parameter of the slow strategy network {tilde over (π)} is denoted as θ_(s); an input of the slow evaluation network Q_(s) ^({tilde over (π)}) is {tilde over (s)}, an output is an evaluation value of each action, and a parameter of the slow evaluation network Q_(s) ^({tilde over (π)}) are denoted as ϕ_(s); 3-2) the fast agent is a deep neural network including a fast strategy network π and a fast evaluation network Q_(f) ^(π), wherein an input of the fast strategy network π is s, an output is probability distribution of the action a, and a parameter of the fast strategy network π is denoted as θ_(f); an input of the fast evaluation Q_(f) ^(π) is (s,a), an output is an evaluation value of actions, and a parameter of the fast evaluation network Q_(f) ^(π) is denoted as ϕ_(f); 4) initializing parameters: 4-1) randomly initializing parameters of the neural networks corresponding to respective agents θ_(s), θ_(f), ϕ_(s), ϕ_(f); 4-2) inputting a maximum entropy parameter α_(s) of the slow agent and a maximum entropy parameter α_(f) of the fast agent; 4-3) initializing the discrete time variable as t=0, an actual time interval between two steps of the fast agent is Δt, and an actual time interval between two steps of the slow agent is kΔt; 4-4) initializing an action probability of the fast continuous device as p=−1; 4-5) initializing cache experience database as D_(l)=∅ and initializing agent experience database as D=∅; 5) executing by the slow agent and the fast agent, the following control steps at the moment t: 5-1) judging if t mod k≠0: if yes, going to step 5-5) and if no, going to step 5-2); 5-2) obtaining by the slow agent, state information from measurement devices in the power distribution network; 5-3) judging if D_(l)=∅: if yes, calculating r_(s), adding an experience sample to D, updating D←D∪{({tilde over (s)},ã,r_(s),{tilde over (s)}′,D_(l))} and going to step 5-4); if no, directly going to step 5-4); 5-4) updating {tilde over (s)} to {tilde over (s)}′; 5-5) generating the action ã of the slow discrete device with the slow strategy network if of the slow agent according to the state information {tilde over (s)}; 5-6) distributing ã to each slow discrete device to realize the reactive voltage control of each slow discrete device at the moment t; 5-7) obtaining by the fast agent, state information s′ from measurement devices in the power distribution network; 5-8) judging if p≥0: if yes, calculating r_(f), adding an experience sample to D_(l), updating D_(l)←D_(l)∪{s,a,r_(f),s′,p)}, and going to step 5-9); if no, directly going to step 5-9); 5-9) updating s′ to s; 5-10) generating the action a of the fast continuous device with the fast strategy network {tilde over (π)} of the fast agent according to the state information s and updating p=π(a|s); 5-11) distributing a to each fast continuous device to realize the reactive voltage control of each fast continuous device at the moment t and going to step 6); 6) judging t mod k=0: if yes, going to step 6-1); if no, going to step 7); 6-1) randomly selecting a set of experiences D^(B)∈D from the agent experience database D, wherein a number of samples in the set of experiences is B; 6-2) calculating a loss function of the parameter ϕ_(s) with each sample in D^(B): $\begin{matrix} {{L\left( \phi_{s} \right)} = {\underset{\overset{\sim}{s},\overset{\sim}{a},r_{s},{\overset{\sim}{s}}^{\prime}}{E}\left\lbrack \left( {{Q_{s}^{\pi}\left( {\overset{\sim}{s},\overset{\sim}{a}} \right)} - {\omega y_{s}}} \right)^{2} \right\rbrack}} & (0.5) \end{matrix}$ where $\begin{matrix} {y_{s} = {r_{s} + {\gamma\left\lbrack {{Q_{s}^{\pi}\left( {{\overset{\sim}{s}}^{\prime},{\overset{\sim}{a}}^{\prime}} \right)} - {\alpha_{s}\log{\overset{\sim}{\pi}\left( {{\overset{\sim}{a}}^{\prime}❘\overset{\sim}{s}} \right)}}} \right\rbrack}}} & (0.51) \end{matrix}$ where ã′˜{tilde over (π)}(·|{tilde over (s)}) and γ is a conversion factor; $\begin{matrix} {\omega = {\prod\limits_{i = 0}^{k - 1}\frac{\pi\left( {a_{i}❘s_{i}} \right)}{p\left( {a_{i}❘s_{i}} \right)}}} & (0.52) \end{matrix}$ 6-3) updating the parameter ϕ_(s): ϕ_(s)←ϕ_(s)−ρ_(s)∇_(ϕ) _(s) L(ϕ_(s))   (0.53) where ρ_(s) is a learning step length of the slow discrete device; 6-4) calculating a loss function of the parameter θ_(s); $\begin{matrix} {{L\left( \theta_{s} \right)} = {- {\underset{\overset{\sim}{s} \in D^{B}}{E}\left\lbrack {Q_{s}^{\pi}\left( {\overset{\sim}{s},{\left. \overset{\sim}{a} \right.\sim{{\overset{\sim}{\pi}}_{\theta_{s}}\left( {\cdot {❘\overset{\sim}{s}}} \right)}}} \right)} \right\rbrack}}} & (0.54) \end{matrix}$ 6-5) updating the parameter θ_(s): θ_(s)←θ_(s)−ρ_(s)∇_(θ) _(s) L(θ_(s))   (0.55) and going to step 7); 7) executing by the fast agent, the following learning steps at the moment t: 7-1) randomly selecting a set of experiences D^(B)∈D from the agent experience database D, wherein a number of samples in the set of experiences is B; 7-2) calculating a loss function of the parameter ϕ_(f) with each sample in D^(B): $\begin{matrix} {{L\left( \phi_{f} \right)} = {\underset{s,a,r_{f},s^{\prime}}{E}\left\lbrack \left( {{Q_{f}^{\pi}\left( {s,a} \right)} - y_{f}} \right)^{2} \right\rbrack}} & (0.56) \end{matrix}$ where $\begin{matrix} {y_{f} = {r_{f} + {\gamma\left\lbrack {{Q_{f}^{\pi}\left( {s^{\prime},a^{\prime}} \right)} - {\alpha_{f}\log{\pi\left( {a^{\prime}❘s} \right)}}} \right\rbrack}}} & (0.57) \end{matrix}$ where a′˜π(·|s); 7-3) updating the parameter ϕ_(f): ϕ_(f)←ϕ_(f)−ρ_(f)∇_(ϕ) _(f) L(ϕ_(f))   (0.58) where p_(f) is a learning step length of the fast continuous device; 7-4) calculating a loss function of the parameter θ_(f); $\begin{matrix} {{L\left( \theta_{f} \right)} = {- {\underset{s \in D^{B}}{E}\left\lbrack {Q_{f}^{\pi}\left( {s,{\left. a \right.\sim{\pi_{\theta_{f}}\left( {\cdot {❘s}} \right)}}} \right)} \right\rbrack}}} & (0.59) \end{matrix}$ 7-5) updating the parameter θ_(f): θ_(f)←θ_(f)−ρ_(f)∇_(θ) _(f) L(θ_(f))   (0.60) 8) let t=t+1, returning to step 5). 